Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix in distributed GPU tests and Distributed set! #3880

Merged
merged 55 commits into from
Nov 9, 2024
Merged

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Oct 29, 2024

This PR modifies the configuration of the distributed pipeline that runs on the Caltech cluster to allow using CUDA-aware MPI.

closes #3897

@glwagner glwagner marked this pull request as ready for review October 29, 2024 16:01
Copy link
Member

@ali-ramadhan ali-ramadhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 120G
slurm_mem: 8G
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@simone-silvestri simone-silvestri Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120G is much more than we need for those tests. After some frustration, because tests were extremely slow to start, I noticed that the agents began much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in when the cluster is busy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good reason. might warrant a comment

Project.toml Outdated

[targets]
test = ["DataDeps", "Enzyme", "SafeTestsets", "Test", "TimesDates"]
test = ["DataDeps", "SafeTestsets", "Test", "Enzyme", "MPIPreferences", "TimesDates"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this the crucial part?

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. For future generations, can you please write a little bit about what you tried and what ended up working? I can't tell if all the changes are necessary, though the end result is fairly clean. Mostly I am wondering about slurm_mem. I'm also curious why we cannot call precompile_runtime inside runtests.jl and it is necessary to call it before Pkg.test(). This has implications for the CI of other packages.

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Nov 7, 2024

I think it is equivalent. I am trying to precompile inside the runtests.

Btw, having again access to GPU distributed tests highlighted a bug related to distributed architectures specifically for the set! function, which I am fixing in this PR.

@simone-silvestri simone-silvestri changed the title New strategy for defining architecture in distributed tests Bugfix in distributed GPU tests and Distributed set! Nov 7, 2024
@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Nov 7, 2024

there are two distinct issues with the GPU tests

  1. MPI was not CUDA-aware
  2. CUDA_runtime was not found in the tests

Issue number (1) is solved, but, unfortunately, some tests fail stochastically because CUDA_runtime is not found.
It seems to be linked to how late the particular test is started after the init step, i.e. tests that start much later will fail with CUDA_runtime not found error. I am still investigating

@simone-silvestri
Copy link
Collaborator Author

Ok, with some fiddling CUDA seems to be found correctly now. I think this

CUDA_Runtime_jll = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
is the change that made it work.

However, I have added a failsafe option following this suggestion (which was the error we were encountering)
https://github.com/JuliaGPU/CUDA.jl/blob/a085bbb3d7856dfa929e6cdae04a146a259a2044/src/initialization.jl#L96
which reloads and compiles CUDA_Runtime_jll and restarts the julia session.

@@ -51,7 +51,7 @@ steps:
commands:
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'"
agents:
slurm_mem: 8G
slurm_mem: 50G
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably reduce the memory usage of the tests right? I think often a bigger grid is used than needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, unit tests do not require too much memory. I have seen that 32G was not enough for the regression tests on the GPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they might be too big

@navidcy navidcy added the bug 🐞 Even a perfect program still has bugs label Nov 9, 2024
@navidcy navidcy merged commit bfee493 into main Nov 9, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Even a perfect program still has bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The MPI we use in the distributed tests is not CUDA-aware
4 participants